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1. INTRODUCTION 

Nowadays, the advances of technologies in artificial intelligence and machine learning have enabled 
wide development of automated tools for answering customers’ queries, collecting surveys, addressing 
complaints without human involvements. These tools are usually chatbots [1-6], or more advanced, voicebots 
[3, 7-9]. For voicebots, it is essential to have engines called text-to-speech (TTS) for performing conversion 
of answering text to speech and playback to customer during a call. Usually, there are two steps in a TTS 
conversion: (1) converting text to melspectrogram; and (11) synthesize melspectrogram to waveform [10]. 

The recently introduced end-to-end [11], neural network-based models for generating TTS [12, 13] 
are Tacotron [14], Tacotron-2 [15-17], Es-Tacotron-2 [15], WaveNet [18-20], WaveGlow [21]. In [14], a 
TTS based on Tacotron model was introduced to generate speech at frame level which enabled faster speech 
synthesis compared to approach using sample level approach. The training was based on a single professional 
female speaker with approximately 25 hours of recorded speeches; thus, input audios’ quality can be 
guaranteed and variants are minimal. The input audios had sampling rate of 24 kHz and the training steps 
were up to 2 million. In order to reduce the training steps, it shall be possible for one to reduce the sampling 
rate. For synthesizing waveform, Griffin-Lim model (exisiting since 1984 [22] and catching attention to date 
[23]) was used [14]. For further improving Tacotron-2 model by addressing over-smoothness problem 
resulting in unnatural generated speeches, Y. Liu and J. Zheng [15] proposed adding an Es-Network into the 
existing model. 
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The idea was to make generated speeches more natural by employing the Es-Network for 
calculating the estimated melspectrogram residual and making this an additional task of Tacotron2 model. N. 
Li et al. [16] improved Tacotron2 model’s speed during training by replacing its attention mechanism by a 
multi-head one. This was inspired by transformer network used in neural machine translation. However, the 
drawback of this approach is that it used text-to-phoneme conversion for processing data to learn English 
language which shall discard the meaning of the orginal end-to-end TTS engine proposed for Tacotron [19], 
Tacotron2. Although using WaveNet for synthesizing speech may improve speech quality [18, 24-27], its 
system will need to train two separate networks, one for converting speech to melspectrogram and the other 
for synthesizing the speech from the melspectrogram [20]. WaveNet variant such as WaveGlow [21], on 
similar dataset, also required training steps up to 580,000 with audio files sampled at 16 kHz. For 
synthesizing audio waveform from melspectrogram and for use in very large audio dataset (1.e., 960 hours 
from 2,484 speakers), multi-head convolutional neural network was proposed [28]. However, its performance 
for the case of low number of heads, 1.e., 2, was just slightly above the average. Even though, [29, 30] also 
attempted to work on very large audio dataset using the proposed Deep Voice models, the results obtained 
were not as comparative as Tacotron2. 

As seen from the above analysis, the developed engines mainly support English and Chinese, the 
most popular languages in the world. Meanwhile, Vietnamese is not supported yet. Although, the local TTS 
tools [31, 32] are supporting well Vietnamese language, there is little information about their back-end 
engines. In addition, among the developed models, Tacotron and Tacotron-2 are the most utilied end-to-end 
TTS. Eventhough, it lacks of support for Vietnamese. Therefore, this work presents the first open approach 
for tailoring a Tacotron-2-based TTS engine utilizing FPT open speech dataset (FOSD) [31, 33, 34]. To the 
best of author’s knowledge, this work is the first that attempts to utilize the freely available to public dataset, 
FOSD. The main contributions of this work are: 

- The newly developed cleaner for supporting Vietnamese speech generation using the TTS’ back-end 
engine provided by Mozilla [35] 
- The utilization of the publicly available dataset FOSD [34] for Vietnamese speech generation from text 


- The method and analysis of a trained (up to 225,000 steps) TTS model for generating Vietnamse speech 
[9, 36] 

The remaining of this paper is organized as follows; section 2 details the method; section 3 discusses results 

obtained; section 4 concludes this research. 


2. RESEARCH METHOD 

In this section, the overall research method is presented in Figure 1. At first, the approach for 
processing dataset is presented. Second, the core settings for Tacotron-2 engine to be trained and tested are 
outlined, this eases readers to further investigate the proposed approach. Third, the role of the developed 
Vietnamese cleaners, as part of the TTS engine is described to help readers better understanding the 
differences between English and Vietnamese texts. Next, the information of the trained model is presented to 
give readers how much effort was put to run the training model and at which conditions of the training model 
used in this work. Finally, the approach for creating input data (Vietnamese texts) is shown to provide 
various cases of the tests conducted in this work. 
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Figure 1. Overall research method 
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2.1. Dataset processing 

The dataset contains over 25,000 audio files (approximately 30 hours of recording) in Vietnamese 
separated into two main subsets [33, 34]. All audio files are in compressed format (1.e., *.mp3) while their 
transcripts are stored in *.txt files within the same subfolders. The audio file bitrate is 64 kbps. In order to 
feed these audio files into the Mozilla-based TTS engine, by using SOX toolbox [37], they were all converted 
into *.wav format with bitrate of 352 kbps. In addition, all the audio files were placed together in one folder 
for training the model. The transcript files were also compiled into one file; each line follows the style: 
audio_file_nameltranscriptlspeech_start_time_1-speech_end_time_1 speech_start_time_2-end_time_2. 

Here, the audio_file_name is the file name including the extension; the transcript is the text in the 
speech; speech durations are marked by two ends (1.e., speech_start_time_1-speech_end_time_1); if there 
are multiple speeches in one file, each duration is separated by a space character. 

The transcript file was then separated into two *.csv files for training and testing the engine. The 
training file consisted of 23,000 transcript lines while the testing file consisted of 1,900 transcript lines. The 
detail step-by-step guidelines for this data processing can be found in [38]. 


2.2. Tacotron-2 architecture settings 

In this work, Tacotron-2 architecture based on [19] was utilized since it provides better output 
quality compared to Tacotron architecture, recommended in Mozilla’s notes to developer in [35]. Table 1 
presents the typical configuration of the important parameters for training the model. In this table, the number 
of mel-spectrograms was 80, the number of short-time fourier transform (STFT) frequency levels (equals to 
size of linear spectrogram frame) was 1,025, same as the default value. The sampling rate was set to 22,050 
Hz for faster training the Tacotron-2 architecture. Since the model used in this work was Tacotron-2, softmax 
function was used for calculating attention norm, suggested by Mozilla. The complete TTS’ engine’s 
configuration can be found in [9]. 


Table 1. TTS’ Engine configuration parameters for training 


Parameter Value 
num_mels 80 
num_freq 1,025 
sample_rate 22,050 Hz 
Model Tacotron2 
attention_norm softmax 
min_seq_len 6 -> 10 
max_seq_len 150 -> 100 
use_phonemes false 
text_cleaner Vietnamese_cleaners 
datasets.name fptopenspeechdata 
datasets.path /content/drive/My Drive/FPTOpenSpeechData/ 
datasets.meta_file_train metadata_train.csv 
datasets.meta_file_val metadata_val.csv 


In addition, the minimum and maximum sequence lengths were changed from 6 to 10 and 150 to 
100 respectively after the first 100,000 training steps. This is to make the model faster to converge and more 
suitable with the existing dataset which has minimum sequence length of 2, maximum sequence length of 
301, average sequence length of 52.43. As a result, 1,145 instances were discarded since they were out of the 
aforementioned sequence length range. In this work, using phoneme option was disabled since it was out of 
this research focus. Meanwhile, a new text cleaner namely “Vietnamese_cleaners” was newly developed for 
processing Vietnamese texts. The dataset path, meta file for training and validation were provided as well. It 
should be noted that, the model was trained completely on Google Colaboratory, a free TensorFlow- 
supported platform. 


2.3. Vietnamese cleaners 
The Vietnamese cleaners was developed to support Vietnamese language instead of English as in 
the original repository. The cleaner allows the special conversions of: 
- symbols to words: e.g., “+” to “công” (English: plus) 
- special characters to words: e.g., “%” to “phan tram” (English: percent) 
- special words to similar words with the same pronunciations: e.g., “hy” to “hi” (English: happy) 
- number to words: e.g., “11” to “mudi môt” (English: eleven) 
Here, it should be noted that all capitalized words were converted to lowercase to form uniform 
source texts before feeding to the network for training, validation and testing. 
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2.4. Training model 

In order to prove that the developed Vietnamese cleaners are suitable for the model to generate clear 
Vietnamese speeches from random texts, the model was trained for 225,000 steps. As a result, the training 
loss was 0.10406 while the validation loss was 0.12349. 


2.5. Random texts for speech generation 

In Table 2, the uncorrelated random texts were selected for testing the trained TTS model. The first 
text was an unusual statement comparing sizes of “one” duck and a cow. In this text, the word “mot” 
(English: “one”) was used to test if the trained model could generate a speech containing a number. The 
second text was a statement describing a female having a name of “søn”, here, the letter “s” was not 
capitalized. The third text was a statement describing the event that two footballers were invited to Spain for 
career probation. The fourth text was a statement describing Hanoi streets during spring, near Vietnamese 
Lunar New Year. The fifth text was a statement describing how Vietnamese footballer stars spend money. 


Table 2. Uncorrelated texts for testing model 


No. Input Text (Vietnamese) Translated Version in English 
1 môt con vit to nhu con bò one duck is as big as a cow 
chi son xinh gái nhi sister son is beautiful 


Không chi có Tuan Anh, Van Toàn cùng Not only Tuan Anh, Van Toan (footballer) also was invited 


2 dugc mòi sang thir viéc tai Tay Ban Nha for career probation in Spain 

4 Dao xudng phé som, nhiéu tuyén duong Peach blossom moves down early to street, many streets in 
Ha Noi da rôn rang sac xuân Hanoi is already bustling the spring 

5 Sao bong da Viêt dua nhau ting xé sang Vietnamese star footballer following each other to buy 
bac ty cho nguoi than billion-dong luxury cars for their relatives 


3. RESULTS AND DISCUSSION 

In this work, the results obtained from the trained Vietnamese TTS model is discussed. At first, the 
generated speeches are accessed based on its completeness. This indicates whether the model is able to 
generate complete speeches based on given texts. Second, the speeches are accessed based on its clearness 
and naturalness subject to MOS scores, the typical index for accessing the quality of generated speeches from 
TTS engine. 


3.1. Completeness of the generated speeches 

Out of the five generated speeches, three (the first, the second, and the fifth) were complete. The 
third speech missed 2/17 words while the fourth one missed 10/14 words (i.e., the second part of the 
sentence, after the comma). Further analyzing the missing words, Table 3 presents the frequencies of missing 
words in the training and validation sets which were used for training and validating the developed FOSD 
model. From the table, it could be seen that, the typical ratio of validation words over training words were 
from approximately 0.05 to 0.14 

It is obvious that too little frequencies in the validation set could cause missing words in the 
generated speech, 1.e., 2 times for the words “sac” and “xuân”. In addition, too many frequencies also could 
cause the same issue, i.e., from above 1,000 to over 2,000 or 3,000 times for the words “nhiéu”, “da”, and 
“Hà” respectively. 


Table 3. Frequencies of missing words 
No. Word Training Validation Ratio Validation/Training 


1 Van 167 20 0.1197 
2 Toan 267 23 0.0861 
3 nhiéu 1,038 81 0.0780 
4 tuyén 57 8 0.1404 
5 duong 395 31 0.0785 
6 Hà 3,056 259 0.0848 
7 Nôi 166 9 0.0542 
8 da 1,829 125 0.0683 
9 rôn 149 15 0.1007 
10 ràng 49 7 0.1429 
11 sac 80 2 0.0250 
12 xuân 35 2 0.0571 
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3.2. Clearness and naturalness of the generated speeches 

A crowd-source survey was conducted on a set of 100 random participants who are students at FPT 
University to assess the clearness and naturalness of the generated speeches. Here, the naturalness refers to 
the state or quality of being natural (human-like) in the generated speeches while the clearness indicates the 
clarity (low noise) in the generated speeches. Based on the survey, 50% of the students used headphones 
while the other 50% used computer speakers for the test. In addition, all of the students had never heard 
about the sentences and speeches before. Their MOS were outlined in the Table 4. From the table, the MOS 
for clearness was ranging approximately from 2 to 4.5. Four out of five speeches were considered clear while 
the second speech was the least clear one. The clearest speech was the fifth, its MOS was 3.39 with standard 
deviation of 0.98 making it the best speech in the test set. Meanwhile, the MOS for the generated speeches’ 
naturalness were typically slightly lower than those of clearness. Still, the fifth speech was the most natural 
speech in the test set. Here, three out of five speeches were above the average (about 2.50). 


Table 4. Clearness and naturalness of generated speeches-MOS 


No. Clearness Naturalness 
1 2.95 1.15 2.54 1.12 
2 2.62 + 1.17 2.52 + 1.07 
3 2.94 + 1.07 2.84 + 1.00 
4 2.97 1.17 2.81 + 1.02 
5 3.39 + 0.98 3.06 + 1.07 


4. CONCLUSION 

This paper has presented the first approach for generating FOSD Tacotron-2-based TTS engine for 
Vietnamese. The work opens new insights into the generation of speeches from texts. To be particular, too 
little or excessively large frequencies of texts in training and validation sets could cause missing of the words 
in the generated speeches. Overall, all the generated speeches are above the average in terms of clearness and 
naturalness. Future works will explore more possibility of generating quality speeches from an optimal 
dataset. 
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